Non-asymptotic Properties of Individualized Treatment Rules from Sequentially Rule-Adaptive Trials

Learning optimal individualized treatment rules (ITRs) has become increasingly important in the modern era of precision medicine. Many statistical and machine learning methods for learning optimal ITRs have been developed in the literature. However, most existing methods are based on data collected from traditional randomized controlled trials and thus cannot take advantage of the accumulative evidence when patients enter the trials sequentially. It is also ethically important that future patients should have a high probability to be treated optimally based on the updated knowledge so far. In this work, we propose a new design called sequentially rule-adaptive trials to learn optimal ITRs based on the contextual bandit framework, in contrast to the response-adaptive design in traditional adaptive trials. In our design, each entering patient will be allocated with a high probability to the current best treatment for this patient, which is estimated using the past data based on some machine learning algorithm (for example, outcome weighted learning in our implementation). We explore the tradeoff between training and test values of the estimated ITR in single-stage problems by proving theoretically that for a higher probability of following the estimated ITR, the training value converges to the optimal value at a faster rate, while the test value converges at a slower rate. This problem is different from traditional decision problems in the sense that the training data are generated sequentially and are dependent. We also develop a tool that combines martingale with empirical process to tackle the problem that cannot be solved by previous techniques for i.i.d. data. We show by numerical examples that without much loss of the test value, our proposed algorithm can improve the training value significantly as compared to existing methods. Finally, we use a real data study to illustrate the performance of the proposed method.

The following lemma gives an analogy of Talagrand's inequality on martingale processes. A Z-valued tree z of depth n is a rooted complete binary tree with nodes generated by elements of Z. The tree z = (z 1 , . . . , z n ) is a sequence of labeling functions such that z i : {±1} i−1 → Z. Let η = (η 1 , . . . , η n ) be a sequence of i.i.d. Rademacher random variables. Then the sequential Rademacher complexity of a function class F ⊂ R Z on a Z-valued tree z is defined as ff .
Further, define R n (F) := sup z R n (F, z), and Rakhlin et al. (2015) showed that where D = inf z∈Z sup f,f ∈F [f (z) − f (z)] ≥ 0. This indicates that R n (F) and the expectation of the martingale process suprema sup P E sup f ∈F M n (f ) are on the same scale. The covering numbers are also extended to sequential data. A set V of R-valued trees of depth n is a (sequential) -cover with respect to L p -norm of F ⊂ R Z on a tree z of depth n if for any f ∈ F and any η ∈ {±1} n , there exists v ∈ V such that 1 n n i=1 |v i (η) − f (z i (η))| p˘1 /p ≤ . The sequential covering number of a function class F on a given tree z is defined as N p ( , F, z) = min {|V | : V is an -cover with respect to L p -norm of F on z} .
Moreover, define the maximal L p covering number of F over depth-n trees as N p ( , F, n) = sup z N p ( , F, z).
Lemma 13 (Lemma 15 in Rakhlin et al. 2015) For F ⊂ [−1, 1] Z , for n ≥ 2 and any t > 0, we have that under the mild assumptions R n (F) ≥ 1/n and N ∞ (2 −1 , F, n) ≥ 4. Here c is an absolute constant and L > e 4 is such that From the above lemma, we can see that the concentration inequality essentially relies on the sequential Rademacher complexity R n (F), which can be upper and lower bounded by functions of E sup f ∈F M n (f ).
To obtain a bound on the suprema of martingale process over a finite set, we take use of a martingale inequality and a conclusion about L ψ -Orlicz norm. Freedman's inequality is an extension of Bernstein's inequality to martingale difference sequences.
Lemma 14 (Freedman's inequality, Freedman 1975) Suppose {X i } i≥1 is a G i -adapted martingale difference sequence and S n = n i=1 X i . Then for all t > 0, The following lemma gives a bound on the expectation of suprema over a finite set using L ψ -Orlicz norm.
Lemma 15 (Van der Vaart and Wellner 1996) Suppose that X 1 , . . . , X n are arbitrary random variables satisfying the probability tail bound for all t > 0 and i = 1, . . . , n for fixed positive numbers c and d. Then there is a universal where the L ψ -Orlicz norm is defined as X ψ = inf {c > 0 : Eψ(|X| /c) ≤ 1} for any random variable X, and ψ p = e x p − 1 is a Young modulus for each p ≥ 1.
The following inequality is a key step in the proof of Lemma 6. It bounds the expectation of the suprema of a martingale process over a finite set, after which the bound on a general set can be derived.
Proof First we rewrite Lemma 14 in the form of a martingale process. For a G i -adapted Scale both sides by a factor of ? n and we get P( ? where t > 0 and M n = n i=1 Var[f (Y i )|G i−1 ]. The result follows by applying the inequality (10) to Lemma 15 and expand the L ψ -Orlicz norm.
The following lemma shows how the dependence of π i onf i−1 can be canceled by the sampling probability so that it reduces to a constant term.
We have that π i (f i−1 (X i ); H i−1 , X i ) is a fixed function of H i−1 , X i and that it equals E[1 pI i = 1q |H i−1 , X i ]. It is also true for the second term in the bracket. Therefore, the right-hand side equals E r2G(X i )|G i−1 s. The result follows from the assumption that X i is independent of the history. The first equality in (11) can be proved by taking expectation of both sides of the equation.
Similarly, when G is divided by the square term of π i in (12), Since one of π i (f i−1 (X i ); H i−1 , X i ) and π i (−f i−1 (X i ); H i−1 , X i ) must be lower bounded by 0.5 and the other one is lower bounded by i , the inequality follows. Now for the first term in (12), its upper bound can be proved by taking expectation of both sides of the inequality.

Appendix B. Proof of Lemma 6
The proof essentially follows the proof of Van der Vaart and Wellner (1996, Theorem 2.5.6), the bracketing entropy Donsker theorem, with an extension to martingale sequences. In this proof, we assume that there exists a constant τ 2 such that the second conditional moment E(R 2 |X, A) ≤ τ 2 . This also implies that the conditional variance Var(R|X, A) is bounded by τ 2 . However, we will show that τ 2 does not appear in the dominating term of the final bound. Proof Define L 2,∞ (P) norm as f P,2,∞ = sup x>0 [x 2 P(|f (X)| > x)] 1/2 . Note that L 2,∞ (P) norm is not actually a norm, but it can be shown that there is a norm equivalent to it up to a constant multiple. The assumption (4) implies that because f P,2 ≥ f P,2,∞ for any measurable function f , and we have N [] (η, F, L 2 (P)) ≥ N (η, F, L 2 (P)) for any function class F. For each positive integer q, define a bracketing number N 1 q := N [] (2 −q , F, L 2,∞ (P)) and a covering number N 2 q := N (2 −q , F, L 2 (P)). Then there are two partitions Take intersection of the two partitions that correspond to the bracketing number and covering number respectively. The total number of sets will be N q := N 1 q N 2 q and this joint partition {F qj } Nq j=1 satisfies the combined conditions: Furthermore, the sequence of partitions can be chosen to be nested. To see this, consider a sequence of partitions F qj N q j=1 that are possibly not nested. Take the partition at stage q to consist of all intersections of the form q p=1F p,ip . Then this generates N q =N 1 . . .N q sets. Conditions (13) -(15) continue to hold since (log q p=1N p ) 1/2 ≤ q p=1 (logN p ) 1/2 . Now for each q, fix a function f qj ∈ F qj to be the representative of the set F qj and let ξ be the function of choosing the representative. In addition, let ∆ be the function of finding the "size" of the set that a function belongs to. Then we have . Note that ξ q h f and ∆ q h f form sets of only N q functions when h f ranges over F. We will actually approximate each h f with ξ q h f and ∆ q h f . While F may be infinite, ξ q h f and ∆ q h f run over finite sets.
Let Log(x) := 1 + log(x). For each fixed n and q 0 , define truncation levels a q and indicator functions A q , B q for q ≥ q 0 as Since the partitions are nested, the functions A q and B q are constants in f on each set F qj in level q. The key observation here is that pointwise in x. To see this, note that either B q f = 0 for all q or there is a unique q 1 such that B q f = 1. In the former case, the first two terms are all zero and the third term has canceling components and converges to f − ξ q 0 f . In the latter case, the right-hand side of (16) is equivalent to h f − ξ q 1 h f + q 1 q=q 0 +1 (ξ q h f − ξ q−1 h f ), and the result follows. Write M n (f ) F as the supremum of |M n (f )| as f ranges over F. Then E * sup f ∈F W n (f ) can be bounded as To bound the first term (17), note that for any function class H with some envelope function H, na q 0 by the definitions of envelope function F and indicator function B q 0 . Therefore, The equality comes from Lemma 17 and the third line is true since X i 's are i.i.d. Choose q 0 such that 2 −q 0 = δ F P,2 for some δ > 0. Then For any function class H with some envelope function H, we can bound |M n h| by

the second term (18) can be bounded by
By Corollary 16, for each q in the first term in (22), the expectation can be split into two parts: Since ∆ q f B q f ≤ ∆ q−1 f A q−1 f ≤ ? na q−1 , the L ∞ term in the first part can be bounded by for any f ∈ F. For the second part, by the assumption of the second conditional moment, The last inequality comes from Lemma 17. For any non-negative random variable X, we have the inequality X 2 2,∞ ≤ sup t>0 tE rX1(X > t)s ≤ 2 X 2 2,∞ . Then ? na Since ∆ q f B q f is bounded by ? na q−1 for q > q 0 , it follows that Using Lemma 17 again and the inequality (25), the second term in (22) can be bounded as Now apply the above bounds (23), (24), (26) on (18) to find The last inequality comes from the fact that a q−1 /a q ≤ (a q−1 /a q ) 2 for decreasing a q . To handle the third term (19), first note that it is bounded by since the partition is nested. Then we can use Corollary 16 as in the bound of the second term (18). The maximum of L ∞ norm over F in the first part is upper bounded by r ? na q−1 / n . For the second part, use Lemma 17 and assumption (15) to find Combining the two parts together, we have For the last term (20), consider two cases on whether the envelope function F is bounded by ? na q 0 or not. Apply Corollary 16 in the first case. With the supremum part bounded by na q 0 and the conditional variance part bounded by we have In the second case, since ξ q 0 h f 1 pF > ? na q 0 q is bounded by 2r by the same argument in the bounds for (21). Therefore, by applying the triangle inequality and choosing q 0 so that 2 −q 0 = δ F P,2 for some constant δ > 0, where N q 0 = N [] (δ F P,2 , F, L 2 (P)). Finally, combine the four upper bounds (21), (27), (28) and (29) together to get Appendix C. Proof of Theorem 1 Using Lemma 13 and Lemma 6, we can give our proof of Theorem 1. Apart from applying the bound on the expectation of supremum to the concentration inequality of martingale process, we also combine the pilot trial with the main trial which follows the adaptive design.
, which is the opposite of excess 0-1 risk. For the initial n 0 i.i.d. observations, we know that the excess 0-1 risk is bounded by excess φ-risk, that is, for any i = 1, . . . , n 0 and any measurable f by Theorem 3.2 in Zhao et al. (2012). For sequentially generated data {Z i } n i=1 , note that conditioning on G i−1 , for any i = 1, . . . , n and any measurable f . The inequality can be proved similarly as in the i.i.d. case of Theorem 3.2 in Zhao et al. (2012), but with a condition on G i−1 . Therefore, the value function difference V(f * ) − V(f n ) is upper bounded by .
In SRAT,f n should be minimizing n It follows that Now it suffices to bound the right-hand side of (30). We will use Lemma 13 to bound the martingale part. First we test the conditions of the lemma. Since R n (F) and sup P E sup f ∈F M n (f ) are on the same scale and the latter one sup P E sup f ∈F M n (f ) is in the order of 1/ ? n, the first assumption in Lemma 13 is satisfied. The second one can be satisfied when taking a large class F, for example, a linear class with parameters bounded loosely.
Let H(F) be the class of functions constructed by h f as f ranges over F. According to (8), Since R and h f can take value zero, D = 0 here. Therefore, R n (H(F)) is bounded by rJ [] ( F P,2 , F, L 2 (P))/( ? n n ) up to a constant by Lemma 6. Since is upper bounded by 2rb/ n for all i and all f ∈ F, scale (9) and we get n log 3 n t 2 Cr 4 b 2 J 2 for some constant C and any t > 0. In other words, for some constant C and any δ > 0.
To derive a bound for the initial randomized treatments of size n 0 , we will take use of a variant of Talagrand's inequality (Talagrand, 1994) in Lemma 12, which is a common approach in i.i.d. classification problems. In our setting, is bounded by 4b 2 r 2 . The key step here is to bound in Lemma 12. By Theorem 2.14.2 in Van der Vaart and Wellner (1996), the expectation of supremum of an empirical process is bounded by the bracketing integral. Following a similar proof of Lemma 6, with only Freedman's inequality (Freedman, 1975) replaced by Bernstein's inequality, we know µ * ≤ rJ ? n 0 , since J is the supremum of bracketing integrals over all possible measures. Therefore, Now by the triangle inequality and the fact that P(|X + Y | ≥ a + b) ≤ P(|X| ≥ a) + P(|Y | ≥ b), (rJ + r ? δb) ? n 0 + rbδ + r 2 bJ 2 n b n log 3 nδ ˙≤ e −δ (33) for some constant C and any δ > 0. The result on test data follows by combining inequalities (30) and (33).

Appendix D. Proof of Theorem 7
Proof First note that Given G i−1 and I i , E(R i |G i−1 , I i ) is actually V(f i−1 I i ). Then E i−1 R i can be written as where p i (H i−1 , X i ) is the probability of I i = 1. So the second term of the right-hand side of (34) is upper bounded by For the first term, note that {R i − E i−1 R i } n i=1 is a martingale difference sequence. We will use the Freedman's inequality in Lemma 14. The two parameters can be bounded as Let the right-hand side be e −δ and the result follows.